24 research outputs found
Listening to the World Improves Speech Command Recognition
We study transfer learning in convolutional network architectures applied to
the task of recognizing audio, such as environmental sound events and speech
commands. Our key finding is that not only is it possible to transfer
representations from an unrelated task like environmental sound classification
to a voice-focused task like speech command recognition, but also that doing so
improves accuracies significantly. We also investigate the effect of increased
model capacity for transfer learning audio, by first validating known results
from the field of Computer Vision of achieving better accuracies with
increasingly deeper networks on two audio datasets: UrbanSound8k and the newly
released Google Speech Commands dataset. Then we propose a simple multiscale
input representation using dilated convolutions and show that it is able to
aggregate larger contexts and increase classification performance. Further, the
models trained using a combination of transfer learning and multiscale input
representations need only 40% of the training data to achieve similar
accuracies as a freshly trained model with 100% of the training data. Finally,
we demonstrate a positive interaction effect for the multiscale input and
transfer learning, making a case for the joint application of the two
techniques.Comment: 8 page
Learning Interpretable Style Embeddings via Prompting LLMs
Style representation learning builds content-independent representations of
author style in text. Stylometry, the analysis of style in text, is often
performed by expert forensic linguists and no large dataset of stylometric
annotations exists for training. Current style representation learning uses
neural methods to disentangle style from content to create style vectors,
however, these approaches result in uninterpretable representations,
complicating their usage in downstream applications like authorship attribution
where auditing and explainability is critical. In this work, we use prompting
to perform stylometry on a large number of texts to create a synthetic dataset
and train human-interpretable style representations we call LISA embeddings. We
release our synthetic stylometry dataset and our interpretable style models as
resources
Open Knowledge Enrichment for Long-tail Entities
Knowledge bases (KBs) have gradually become a valuable asset for many AI
applications. While many current KBs are quite large, they are widely
acknowledged as incomplete, especially lacking facts of long-tail entities,
e.g., less famous persons. Existing approaches enrich KBs mainly on completing
missing links or filling missing values. However, they only tackle a part of
the enrichment problem and lack specific considerations regarding long-tail
entities. In this paper, we propose a full-fledged approach to knowledge
enrichment, which predicts missing properties and infers true facts of
long-tail entities from the open Web. Prior knowledge from popular entities is
leveraged to improve every enrichment step. Our experiments on the synthetic
and real-world datasets and comparison with related work demonstrate the
feasibility and superiority of the approach.Comment: Accepted by the 29th International World Wide Web Conference (WWW
2020
Ranking and semi-supervised classification on large scale graphs using map-reduce
Label Propagation, a standard algorithm for semi-supervised classification, suffers from scalability issues involving memory and computation when used with largescale graphs from real-world datasets. In this paper we approach Label Propagation as solution to a system of linear equations which can be implemented as a scalable parallel algorithm using the map-reduce framework. In addition to semi-supervised classification, this approach to Label Propagation allows us to adapt the algorithm to make it usable for ranking on graphs and derive the theoretical connection between Label Propagation and PageRank. We provide empirical evidence to that effect using two natural language tasks – lexical relatedness and polarity induction. The version of the Label Propagation algorithm presented here scales linearly in the size of the data with a constant main memory requirement, in contrast to the quadratic cost of both in traditional approaches.
Learning Efficient Representations for Fake Speech Detection
Synthetic speech or “fake speech” which matches personal vocal traits has become better and cheaper due to advances in deep learning-based speech synthesis and voice conversion approaches. This increased accessibility of synthetic speech systems and the growing misuse of them highlights the critical need to build countermeasures. Furthermore, new synthesis models evolve all the time and the efficacy of previously trained detection models on these unseen attack vectors is poor. In this paper, we focus on: 1) How can we build highly accurate, yet parameter and sample-efficient models for fake speech detection? 2) How can we rapidly adapt detection models to new sources of fake speech? We present four parameter-efficient convolutional architectures for fake speech detection with best detection F1 scores of around 97 points on a large dataset of fake and bonafide speech. We show how the fake speech detection task naturally lends itself to a novel multi-task problem further improving F1 scores for a mere 0.5% increase in model parameters. Our multi-task setting also helps in data-sparse situations, commonplace in adversarial settings. We investigate an alternative approach to the data-sparsity problem using transfer learning and show that it is possible to meet purely supervised detection performance for unseen attack vectors with as little as 6.25% of the training data. This is the first known application of transfer learning in adversarial settings for speech. Finally, we show how well our transfer learning approach adapts in an instance-efficient way to new attack vectors using the Real-Time Voice Cloning toolkit. We exceed the purely supervised detection performance (99.18 F1) with as little as 6.25% of the data